Using Deep-Learned Vector Representations for Page Stream Segmentation by Agglomerative Clustering
نویسندگان
چکیده
Page stream segmentation (PSS) is the task of retrieving boundaries that separate source documents given a consecutive (for example, sequentially scanned PDF files). The has recently gained more interest as result digitization efforts various companies and organizations, they move towards having all their available online for improved searchability accessibility users. current state-of-the-art approach neural start document page classification on representations text and/or images pages using models such Visual Geometry Group-16 (VGG-16) BERT to classify individual pages. We view PSS clustering instead, hypothesizing from one are similar each other different in documents, something difficult incorporate approaches. compare performance an agglomerative method with binary model based new publicly dataset experiment either pretrained or finetuned image vectors inputs model. To adapt PSS, we propose switch alleviate effects same class high similarity, report improvement scores this method. Unfortunately, neither embeddings nor outperformed PSS. However, substantially effective than baseline, outperforming embeddings. Finally, number K part input, our use case realistic assumption, surprisingly significant positive effect. In contrast earlier papers, evaluate overlap weighted partial match F1 score, developed Panoptic Quality computer vision domain, metric particularly well-suited it can be used measure segmentation.
منابع مشابه
Agglomerative connectivity constrained clustering for image segmentation
We consider the problem of clustering under the constraint that data points in the same cluster are connected according to a pre-existed graph. This constraint can be efficiently addressed by an agglomerative clustering approach, which we exploit to construct a new fully automatic segmentation algorithm for color photographs. For image segmentation, if the pixel grid with eight neighbor connect...
متن کاملColor Image Segmentation Using Anisotropic Diffusion and Agglomerative Hierarchical Clustering
A new color image segmentation scheme is presented in this paper. The proposed algorithm consists of image simplification, region labeling and color clustering. The vector-valued diffusion process is performed in the perceptually uniform LUV color space. We present a discrete 3-D diffusion model for easy implementation. The statistical characteristics of each labeled region are employed to esti...
متن کاملSegmentation of Expository Texts by Hierarchical Agglomerative Clustering
We propose a method for segmentation of ex-pository texts based on hierarchical agglomera-tive clustering. The method uses paragraphs as the basic segments for identifying hierarchical discourse structure in the text, applying lexical similarity between them as the proximity test. Linear segmentation can be induced from the identified structure through application of two simple rules. However t...
متن کاملAgglomerative Clustering Using Asymmetric Similarities
Algorithms of agglomerative hierarchical clustering using asymmetric similarity measures are studied. Two different measures between two clusters are proposed, one of which generalizes the average linkage for symmetric similarity measures. Asymmetric dendrogram representation is considered after foregoing studies. It is proved that the proposed linkage methods for asymmetric measures have no re...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Algorithms
سال: 2023
ISSN: ['1999-4893']
DOI: https://doi.org/10.3390/a16050259